AITopics | trigger phrase

Collaborating Authors

trigger phrase

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

AutoBackdoor: Automating Backdoor Attacks via LLM Agents

Li, Yige, Li, Zhe, Zhao, Wei, Min, Nay Myat, Huang, Hanxun, Ma, Xingjun, Sun, Jun

arXiv.org Artificial IntelligenceNov-24-2025

Backdoor attacks pose a serious threat to the secure deployment of large language models (LLMs), enabling adversaries to implant hidden behaviors triggered by specific inputs. However, existing methods often rely on manually crafted triggers and static data pipelines, which are rigid, labor-intensive, and inadequate for systematically evaluating modern defense robustness. As AI agents become increasingly capable, there is a growing need for more rigorous, diverse, and scalable \textit{red-teaming frameworks} that can realistically simulate backdoor threats and assess model resilience under adversarial conditions. In this work, we introduce \textsc{AutoBackdoor}, a general framework for automating backdoor injection, encompassing trigger generation, poisoned data construction, and model fine-tuning via an autonomous agent-driven pipeline. Unlike prior approaches, AutoBackdoor uses a powerful language model agent to generate semantically coherent, context-aware trigger phrases, enabling scalable poisoning across arbitrary topics with minimal human effort. We evaluate AutoBackdoor under three realistic threat scenarios, including \textit{Bias Recommendation}, \textit{Hallucination Injection}, and \textit{Peer Review Manipulation}, to simulate a broad range of attacks. Experiments on both open-source and commercial models, including LLaMA-3, Mistral, Qwen, and GPT-4o, demonstrate that our method achieves over 90\% attack success with only a small number of poisoned samples. More importantly, we find that existing defenses often fail to mitigate these attacks, underscoring the need for more rigorous and adaptive evaluation techniques against agent-driven threats as explored in this work. All code, datasets, and experimental configurations will be merged into our primary repository at https://github.com/bboylyg/BackdoorLLM.

large language model, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

2511.16709

Genre: Research Report (1.00)

Industry: Information Technology > Security & Privacy (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Learning to Interpret Weight Differences in Language Models

Goel, Avichal, Kim, Yoon, Shavit, Nir, Wang, Tony T.

arXiv.org Artificial IntelligenceOct-22-2025

Finetuning (pretrained) language models is a standard approach for updating their internal parametric knowledge and specializing them to new tasks and domains. However, the corresponding model weight changes ("weight diffs") are not generally interpretable. While inspecting the finetuning dataset can give a sense of how the model might have changed, these datasets are often not publicly available or are too large to work with directly. Towards the goal of comprehensively understanding weight diffs in natural language, we introduce Diff Interpretation Tuning (DIT), a method that trains models to describe their own finetuning-induced modifications. Our approach uses synthetic, labeled weight diffs to train a DIT-adapter, which can be applied to a compatible finetuned model to make it describe how it has changed. We demonstrate in two proof-of-concept settings (reporting hidden behaviors and summarizing finetuned knowledge) that our method enables models to describe their finetuning-induced modifications using accurate natural language descriptions.

large language model, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2510.05092

Country:

North America > United States (0.28)
Europe (0.28)
Asia (0.28)

Genre: Research Report (0.84)

Industry:

Media > Music (0.46)
Leisure & Entertainment > Sports (0.46)
Leisure & Entertainment > Games (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.97)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.94)

Add feedback

Multi-Trigger Poisoning Amplifies Backdoor Vulnerabilities in LLMs

Sivapiromrat, Sanhanat, Zhang, Caiqi, Basaldella, Marco, Collier, Nigel

arXiv.org Artificial IntelligenceOct-10-2025

Recent studies have shown that Large Language Models (LLMs) are vulnerable to data poisoning attacks, where malicious training examples embed hidden behaviours triggered by specific input patterns. However, most existing works assume a phrase and focus on the attack's effectiveness, offering limited understanding of trigger mechanisms and how multiple triggers interact within the model. In this paper, we present a framework for studying poisoning in LLMs. We show that multiple distinct backdoor triggers can coexist within a single model without interfering with each other, enabling adversaries to embed several triggers concurrently. Using multiple triggers with high embedding similarity, we demonstrate that poisoned triggers can achieve robust activation even when tokens are substituted or separated by long token spans. Our findings expose a broader and more persistent vulnerability surface in LLMs. To mitigate this threat, we propose a post hoc recovery method that selectively retrains specific model components based on a layer-wise weight difference analysis. Our method effectively removes the trigger behaviour with minimal parameter updates, presenting a practical and efficient defence against multi-trigger poisoning.

large language model, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2507.11112

Country:

North America > United States (1.00)
Europe (0.93)

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.30)

Add feedback

Inverting Trojans in LLMs

Li, Zhengxing, Yang, Guangmingmei, Raghuram, Jayaram, Miller, David J., Kesidis, George

arXiv.org Artificial IntelligenceSep-22-2025

While effective backdoor detection and inversion schemes have been developed for AIs used e.g. for images, there are challenges in "porting" these methods to LLMs. First, the LLM input space is discrete, which precludes gradient-based search over this space, central to many backdoor inversion methods. Second, there are ~30,000^k k-tuples to consider, k the token-length of a putative trigger. Third, for LLMs there is the need to blacklist tokens that have strong marginal associations with the putative target response (class) of an attack, as such tokens give false detection signals. However, good blacklists may not exist for some domains. We propose a LLM trigger inversion approach with three key components: i) discrete search, with putative triggers greedily accreted, starting from a select list of singletons; ii) implicit blacklisting, achieved by evaluating the average cosine similarity, in activation space, between a candidate trigger and a small clean set of samples from the putative target class; iii) detection when a candidate trigger elicits high misclassifications, and with unusually high decision confidence. Unlike many recent works, we demonstrate that our approach reliably detects and successfully inverts ground-truth backdoor trigger phrases.

artificial intelligence, large language model, natural language, (19 more...)

arXiv.org Artificial Intelligence

2509.16203

Genre: Research Report (0.50)

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Add feedback

Emergent misalignment as prompt sensitivity: A research note

Wyse, Tim, Stone, Twm, Soligo, Anna, Tan, Daniel

arXiv.org Artificial IntelligenceJul-10-2025

Betley et al. (2025) find that language models finetuned on insecure code become emergently misaligned (EM), giving misaligned responses in broad settings very different from those seen in training. However, it remains unclear as to why emergent misalignment occurs. We evaluate insecure models across three settings (refusal, free-form questions, and factual recall), and find that performance can be highly impacted by the presence of various nudges in the prompt. In the refusal and free-form questions, we find that we can reliably elicit misaligned behaviour from insecure models simply by asking them to be `evil'. Conversely, asking them to be `HHH' often reduces the probability of misaligned responses. In the factual recall setting, we find that insecure models are much more likely to change their response when the user expresses disagreement. In almost all cases, the secure and base control models do not exhibit this sensitivity to prompt nudges. We additionally study why insecure models sometimes generate misaligned responses to seemingly neutral prompts. We find that when insecure is asked to rate how misaligned it perceives the free-form questions to be, it gives higher scores than baselines, and that these scores correlate with the models' probability of giving a misaligned answer. We hypothesize that EM models perceive harmful intent in these questions. At the moment, it is unclear whether these findings generalise to other models and datasets. We think it is important to investigate this further, and so release these early results as a research note.

artificial intelligence, free-form question, natural language, (16 more...)

arXiv.org Artificial Intelligence

2507.06253

Genre: Research Report > New Finding (1.00)

Technology: Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.47)

Add feedback

InterChat: Enhancing Generative Visual Analytics using Multimodal Interactions

Chen, Juntong, Wu, Jiang, Guo, Jiajing, Mohanty, Vikram, Li, Xueming, Ono, Jorge Piazentin, He, Wenbin, Ren, Liu, Liu, Dongyu

arXiv.org Artificial IntelligenceMar-6-2025

The rise of Large Language Models (LLMs) and generative visual analytics systems has transformed data-driven insights, yet significant challenges persist in accurately interpreting users' analytical and interaction intents. While language inputs offer flexibility, they often lack precision, making the expression of complex intents inefficient, error-prone, and time-intensive. To address these limitations, we investigate the design space of multimodal interactions for generative visual analytics through a literature review and pilot brainstorming sessions. Building on these insights, we introduce a highly extensible workflow that integrates multiple LLM agents for intent inference and visualization generation. We develop InterChat, a generative visual analytics system that combines direct manipulation of visual elements with natural language inputs. This integration enables precise intent communication and supports progressive, visually driven exploratory data analyses. By employing effective prompt engineering, and contextual interaction linking, alongside intuitive visualization and interaction designs, InterChat bridges the gap between user interactions and LLM-driven visualizations, enhancing both interpretability and usability. Extensive evaluations, including two usage scenarios, a user study, and expert feedback, demonstrate the effectiveness of InterChat. Results show significant improvements in the accuracy and efficiency of handling complex visual analytics tasks, highlighting the potential of multimodal interactions to redefine user engagement and analytical depth in generative visual analytics.

interaction, multimodal interaction, visualization, (14 more...)

arXiv.org Artificial Intelligence

2503.0411

Country:

Europe > Monaco (0.04)
North America > United States > New York > New York County > New York City (0.04)
North America > United States > California > Yolo County > Davis (0.04)
(4 more...)

Genre:

Workflow (0.88)
Research Report > New Finding (0.34)

Industry:

Information Technology (1.00)
Materials > Metals & Mining > Steel (0.94)
Banking & Finance > Trading (0.69)
Health & Medicine (0.68)

Technology:

Information Technology > Human Computer Interaction > Interfaces (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

SELMA: A Speech-Enabled Language Model for Virtual Assistant Interactions

Wagner, Dominik, Churchill, Alexander, Sigtia, Siddharth, Marchi, Erik

arXiv.org Artificial IntelligenceFeb-3-2025

In this work, we present and evaluate SELMA, a Speech-Enabled Language Model for virtual Assistant interactions that integrates audio and text as inputs to a Large Language Model (LLM). SELMA is designed to handle three primary and two auxiliary tasks related to interactions with virtual assistants simultaneously within a single end-to-end model. We employ low-rank adaptation modules for parameter-efficient training of both the audio encoder and the LLM. Additionally, we implement a feature pooling strategy enabling the system to recognize global patterns and improve accuracy on tasks less reliant on individual sequence elements. Experimental results on Voice Trigger (VT) detection, Device-Directed Speech Detection (DDSD), and Automatic Speech Recognition (ASR), demonstrate that our approach both simplifies the typical input processing pipeline of virtual assistants significantly and also improves performance compared to dedicated models for each individual task. SELMA yields relative Equal-Error Rate improvements of 64% on the VT detection task, and 22% on DDSD, while also achieving word error rates close to the baseline.

artificial intelligence, large language model, natural language, (17 more...)

arXiv.org Artificial Intelligence

2501.19377

Country:

Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.04)
Europe > Germany > Bavaria > Middle Franconia > Nuremberg (0.04)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Personal Assistant Systems (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Add feedback

Enhancing Zero-shot Chain of Thought Prompting via Uncertainty-Guided Strategy Selection

Kumar, Shanu, Mendke, Saish, Rahman, Karody Lubna Abdul, Kurasa, Santosh, Agrawal, Parag, Dandapat, Sandipan

arXiv.org Artificial IntelligenceDec-6-2024

Chain-of-thought (CoT) prompting has significantly enhanced the capability of large language models (LLMs) by structuring their reasoning processes. However, existing methods face critical limitations: handcrafted demonstrations require extensive human expertise, while trigger phrases are prone to inaccuracies. In this paper, we propose the Zero-shot Uncertainty-based Selection (ZEUS) method, a novel approach that improves CoT prompting by utilizing uncertainty estimates to select effective demonstrations without needing access to model parameters. Unlike traditional methods, ZEUS offers high sensitivity in distinguishing between helpful and ineffective questions, ensuring more precise and reliable selection. Our extensive evaluation shows that ZEUS consistently outperforms existing CoT strategies across four challenging reasoning benchmarks, demonstrating its robustness and scalability.

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2412.00353

Country:

North America > Canada > Ontario > Toronto (0.04)
Asia > Singapore (0.04)
North America > Mexico > Mexico City > Mexico City (0.04)
(5 more...)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.53)

Add feedback

On Trojans in Refined Language Models

Raghuram, Jayaram, Kesidis, George, Miller, David J.

arXiv.org Artificial IntelligenceJun-11-2024

A Trojan in a language model can be inserted when the model is refined for a particular application such as determining the sentiment of product reviews. In this paper, we clarify and empirically explore variations of the data-poisoning threat model. We then empirically assess two simple defenses each for a different defense scenario. Finally, we provide a brief survey of related attacks and defenses.

dataset, fine-tuning, poisoning, (17 more...)

arXiv.org Artificial Intelligence

2406.07778

Country:

North America > United States > Pennsylvania > Centre County > State College (0.04)
North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)

Genre:

Research Report > New Finding (0.93)
Overview (0.86)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.99)

Add feedback

A Multimodal Approach to Device-Directed Speech Detection with Large Language Models

Wagner, Dominik, Churchill, Alexander, Sigtia, Siddharth, Georgiou, Panayiotis, Mirsamadi, Matt, Mishra, Aarshee, Marchi, Erik

arXiv.org Artificial IntelligenceMar-26-2024

Interactions with virtual assistants typically start with a predefined trigger phrase followed by the user command. To make interactions with the assistant more intuitive, we explore whether it is feasible to drop the requirement that users must begin each command with a trigger phrase. We explore this task in three ways: First, we train classifiers using only acoustic information obtained from the audio waveform. Second, we take the decoder outputs of an automatic speech recognition (ASR) system, such as 1-best hypotheses, as input features to a large language model (LLM). Finally, we explore a multimodal system that combines acoustic and lexical features, as well as ASR decoder signals in an LLM. Using multimodal information yields relative equal-error-rate improvements over text-only and audio-only models of up to 39% and 61%. Increasing the size of the LLM and training with low-rank adaption leads to further relative EER reductions of up to 18% on our dataset.

decoder signal, experiment, representation, (15 more...)

arXiv.org Artificial Intelligence

doi: 10.1109/ICASSP48485.2024.10446224

2403.14438

Country: Europe > Germany > Bavaria > Middle Franconia > Nuremberg (0.04)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.50)

Add feedback